SETUP

FXN: Save graphical outputs

picsave <- function(graph, name) {
  ggsave(plot = graph, filename= name, device = "pdf", width = 12, height = 8, path = "~/GitHub/S_Lipkind_Rundergrad2020/week3/pics/")
}

PROBLEMS

CHAPTER 4 PROBLEMS

4.4

  1. Why does this code not work?
my_variable <- 10
my_varıable
## [1] 10
#> Error in eval(expr, envir, enclos): object 'my_varıable' not found

Was this written by someone who speaks Turkish or something? Not sure how else someone could use ı instead of i.

  1. Tweak each of the following R commands so that they run correctly:
a <- ggplot(data = mpg) + #dota -> data
  geom_point(mapping = aes(x = displ, y = hwy))
#picsave(a, "4.4.2 graph.pdf")

filter(mpg, cyl == 8) #= -> ==
filter(diamonds, carat > 3) # diamond -> diamonds
  1. Press Alt + Shift + K. What happens? How can you get to the same place using the menus?

Oh my goodness, that is amazing. An entire list of keyboard shortcuts. You could also reach that page by going to Help > Keyboard Shortcuts Help.

CHAPTER 5 PROBLEMS

5.2.4

Find all flights that…

#### 1.1: had an arrival delay of two or more hours

head(flights)
filter(flights, arr_delay >= 2)

#### 1.4: Departed in summer (July, August, and September)

filter(flights, month %in% c(7,8,9))

#### 1.5: Arrived more than two hours late, but didn’t leave late

filter(flights, arr_delay > 2 & dep_delay == 0)

#### 1.7: Departed between midnight and 6am (inclusive)

(d <- filter(flights, dep_time >= 0, dep_time <= 600))

#### 2: Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

?between It’s an inclusive shortcut to find values within a certain range.

e <- filter(flights, dep_time %in% between(dep_time,0, 600))
# d == e

#### 3: How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

filter(flights, dep_time %in% NA) # 8255 rows/flights
#these observations also all have NA dep_delay, arr_time, arr_delay.
#Hypothesis: cancelled flights

5.3.1

How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

arrange(flights, desc(is.na(dep_delay))) #this works, but is it the intended solution?

Sort flights to find the most delayed flights. Find the flights that left earliest.

arrange(flights, dep_time, desc(dep_delay))

Sort flights to find the fastest (highest speed) flights.

y <- arrange(flights, desc(distance), air_time)
#(select(y, distance, air_time)) -> double-checking

Which flights travelled the farthest? Which travelled the shortest?

(longest_distance <- top_n(flights, 10, distance))
(shortest_distance <- top_n(flights, -10, distance))

5.4.1

Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

#one option: select()

What happens if you include the name of a variable multiple times in a select() call?

select(flights, day, month, day, month, dep_delay, dep_delay)

Repeating variable names does not appear to make a difference. Only the sorting of the initial appearance of each name within the list matters.

What does the one_of() function do? Why might it be helpful in conjunction with this vector?

vars <- c("year", "month", "day", "dep_delay", "arr_delay") 
select(flights, one_of(vars))

one_of() allows one to make a character vector with specific column names that you can then select for.

Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

select(flights, contains("TIME"))

The results aren’t too surprising, though I didn’t realize select was not case-sensitive. If I wanted to specify case, I could add the specifier below:

select(flights, contains("time", ignore.case = FALSE))

5.5.2

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

head(flights)
minutes <- function(miltime) {
  miltime %/% 100 * 60 + miltime %% 100
}

mod <- flights %>%
  select(dep_time,sched_dep_time) %>%
  lapply(.,minutes) %>%
  as_tibble() %>%
  rename(
    mod_dep_time = dep_time,
    mod_sched_dep_time = sched_dep_time)

modflights <- flights %>% #I initially tried to join flights and mod together, but I couldn't figure out the 'by' part.
  mutate(
    mod_dep_time = mod$mod_dep_time,
    mod_sched_dep_time = mod$mod_sched_dep_time)

Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

discrepancy <- flights %>%
  mutate(
    arr_dep = arr_time - dep_time) %>%
  select(arr_dep, air_time) %>%
  lapply(.,minutes) %>%
  glimpse(.)
## List of 2
##  $ arr_dep : num [1:336776] 193 197 261 300 178 146 238 112 201 155 ...
##  $ air_time: num [1:336776] 147 147 120 143 76 110 118 53 100 98 ...
#Theoretically, you would expect the values to be the same, but they are not.
#Perhaps there's an issue as a result of military time? When I converted to minutes, though, the difference remained. Air time is shorter than arr_dep.
#Hypothesis: The difference in time is explained by the time the planes take to embark and disembark.
#How to test: add dep_delay and arr_delay to ait_time, see if air_time == arr_dep.

head(flights)
flights %>% 
    mutate(
      air_time_delay = air_time + arr_delay + dep_delay,
      arr_dep = discrepancy$arr_dep) %>%
    select(air_time,air_time_delay,arr_dep, dep_delay, arr_delay) %>%
    lapply(., minutes) %>%
    glimpse(.)
## List of 5
##  $ air_time      : num [1:336776] 147 147 120 143 76 110 118 53 100 98 ...
##  $ air_time_delay: num [1:336776] 160 171 155 124 85 118 132 36 89 104 ...
##  $ arr_dep       : num [1:336776] 153 157 181 180 138 106 158 72 121 115 ...
##  $ dep_delay     : num [1:336776] 2 4 2 39 34 36 35 37 37 38 ...
##  $ arr_delay     : num [1:336776] 11 20 33 22 15 12 19 26 32 8 ...
#However, that doesn't seem to work, either, and the difference between arr_dep and air_time is not explained neatly by either dep_delay or arr_delay.

Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

#solution 1: top_n()

flights %>%
  mutate(
    delays4dayz = dep_delay + arr_delay) %>%
  top_n(.,10, delays4dayz) %>%
  glimpse(.)
## Observations: 10
## Variables: 20
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013...
## $ month          <int> 1, 1, 12, 3, 4, 5, 6, 7, 7, 9
## $ day            <int> 9, 10, 5, 17, 10, 3, 15, 22, 22, 20
## $ dep_time       <int> 641, 1121, 756, 2321, 1100, 1133, 1432, 845, 2257, 1139
## $ sched_dep_time <int> 900, 1635, 1700, 810, 1900, 2055, 1935, 1600, 759, 1845
## $ dep_delay      <dbl> 1301, 1126, 896, 911, 960, 878, 1137, 1005, 898, 1014
## $ arr_time       <int> 1242, 1239, 1058, 135, 1342, 1250, 1607, 1044, 121, ...
## $ sched_arr_time <int> 1530, 1810, 2020, 1020, 2211, 2215, 2120, 1815, 1026...
## $ arr_delay      <dbl> 1272, 1109, 878, 915, 931, 875, 1127, 989, 895, 1007
## $ carrier        <chr> "HA", "MQ", "AA", "DL", "DL", "MQ", "MQ", "MQ", "DL"...
## $ flight         <int> 51, 3695, 172, 2119, 2391, 3744, 3535, 3075, 2047, 177
## $ tailnum        <chr> "N384HA", "N517MQ", "N5DMAA", "N927DA", "N959DL", "N...
## $ origin         <chr> "JFK", "EWR", "EWR", "LGA", "JFK", "EWR", "JFK", "JF...
## $ dest           <chr> "HNL", "ORD", "MIA", "MSP", "TPA", "ORD", "CMH", "CV...
## $ air_time       <dbl> 640, 111, 149, 167, 139, 112, 74, 96, 109, 354
## $ distance       <dbl> 4983, 719, 1085, 1020, 1005, 719, 483, 589, 762, 2586
## $ hour           <dbl> 9, 16, 17, 8, 19, 20, 19, 16, 7, 18
## $ minute         <dbl> 0, 35, 0, 10, 0, 55, 35, 0, 59, 45
## $ time_hour      <dttm> 2013-01-09 09:00:00, 2013-01-10 16:00:00, 2013-12-0...
## $ delays4dayz    <dbl> 2573, 2235, 1774, 1826, 1891, 1753, 2264, 1994, 1793...
#If one were to instead use min_rank(), ties would be settled by choosing the minimum of the "corresponding indices," whatever that means.

What does 1:3 + 1:10 return? Why?

1:3 + 1:10
## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter object
## length
##  [1]  2  4  6  5  7  9  8 10 12 11
#Returns:   2  4  6  5  7  9  8 10 12 11

Because 1:3 is shorter than 1:10, 1:3 “loops” as it adds along 1:10.
(1 2 3 1 2 3 1 2 3 1) + (1 2 3 4 5 6 7 8 9 10)

What trigonometric functions does R provide?

sin(pi)
## [1] 1.224606e-16
cos(pi)
## [1] -1
tan(pi)
## [1] -1.224647e-16
#sec(pi) Not included
#csc(pi) Not included
#cot(pi) Not included

5.6.7

5.